{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0200b94f-9e92-41f1-ae08-a47d5e006789",
   "metadata": {},
   "source": [
    "# Fraud Detection using LightGBM and SMOTE\n",
    "\n",
    "### Introduction\n",
    "We use the dataset of [ULB Credit Card Dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) to train our frauld detection model. In this notebook, we will train our model using [LighGBM](https://microsoft.github.io/SynapseML/docs/next/features/lightgbm/LightGBM%20-%20Overview/) and [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) algorithms, which refers to some execellent work listed as below:\n",
    "\n",
    "* **Fraud detection handbook**: https://fraud-detection-handbook.github.io/fraud-detection-handbook/Foreword.html\n",
    "* **AWS creditcard fraud detector**: https://github.com/awslabs/fraud-detection-using-machine-learning/blob/master/source/notebooks/sagemaker_fraud_detection.ipynb\n",
    "* **Creditcard fraud detection predictive models**: https://www.kaggle.com/code/gpreda/credit-card-fraud-detection-predictive-models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca99fa98-39b7-4abc-8f3f-ee6835764e61",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyspark\n",
    "import yaml\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import warnings\n",
    "\n",
    "from pyspark.sql import functions as F\n",
    "from pyspark.sql.types import FloatType, DoubleType\n",
    "from pyspark.ml.feature import VectorAssembler\n",
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "from pyspark.mllib.evaluation import MulticlassMetrics\n",
    "\n",
    "\n",
    "warnings.filterwarnings('ignore')\n",
    "pd.set_option('display.max_rows', None)\n",
    "pd.set_option('display.max_columns', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5f948ed-661c-4409-be2d-78449e7adc60",
   "metadata": {},
   "outputs": [],
   "source": [
    "def init_spark():\n",
    "    spark = pyspark.sql.SparkSession.builder\\\n",
    "            .appName(\"Fraud Detection-LightGBM\") \\\n",
    "            .config(\"spark.executor.memory\",\"8G\") \\\n",
    "            .config(\"spark.executor.instances\",\"4\") \\\n",
    "            .config(\"spark.executor.cores\", \"4\") \\\n",
    "            .config(\"spark.jars.packages\", \"com.microsoft.azure:synapseml_2.12:0.9.4\") \\\n",
    "            .config(\"spark.jars.repositories\", \"https://mmlspark.azureedge.net/maven\") \\\n",
    "            .getOrCreate()\n",
    "    sc = spark.sparkContext\n",
    "    print(sc.version)\n",
    "    print(sc.applicationId)\n",
    "    print(sc.uiWebUrl)\n",
    "    return spark\n",
    "\n",
    "def load_config(path):\n",
    "    params = dict()\n",
    "    with open(path, 'r') as stream:\n",
    "        params = yaml.load(stream, Loader=yaml.FullLoader)\n",
    "    return params\n",
    "\n",
    "def read_dataset(spark, data_path):\n",
    "    dataset = spark.read.format(\"csv\")\\\n",
    "      .option(\"header\",  True)\\\n",
    "      .option(\"inferSchema\",  True)\\\n",
    "      .load(data_path)  \n",
    "    return dataset\n",
    "\n",
    "def get_vectorassembler(dataset, features='features', label='label'):\n",
    "    featurizer = VectorAssembler(\n",
    "        inputCols = feature_cols,\n",
    "        outputCol = 'features',\n",
    "        handleInvalid = 'skip'\n",
    "    )\n",
    "    dataset = featurizer.transform(dataset)[label, features]\n",
    "    return dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93c15f9b-cd7f-41a2-ad57-3e56c60aa89a",
   "metadata": {},
   "outputs": [],
   "source": [
    "spark = init_spark()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3470f0b-888e-4ea7-85c4-da5544e601a1",
   "metadata": {},
   "source": [
    "### Train detection model using LightGBM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e9da426",
   "metadata": {},
   "source": [
    "You should replace ``{MY_S3_BUCKET}``, ``{TRAIN_S3_PATH}`` and ``{TEST_S3_PATH}`` with actual values before executing code cells containing these placeholders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42e6219f-ba85-48fc-bbce-f1fafb847b94",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_file_path = 's3://{MY_S3_BUCKET}/{TRAIN_S3_PATH}'\n",
    "test_file_path = 's3://{MY_S3_BUCKET}/{TEST_S3_PATH}'\n",
    "fg_train_dataset = read_dataset(spark, train_file_path)\n",
    "fg_test_dataset = read_dataset(spark, test_file_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e3c9e34-2fa2-46d0-9ee8-7c3d660a9e4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "fg_train_dataset.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6536baa7-e0ec-4fe4-983b-cd952329ee37",
   "metadata": {},
   "source": [
    "The dataset contains only numerical features, because the original features have been transformed using PCA. As a result, the dataset contains 28 PCA components, V1-V28, and two features that haven't been transformed, `Amount` and `Time`. Amount refers to the transaction amount, and Time is the seconds elapsed between any transaction in the data and the first transaction.Moreover, The `Class` column corresponds to whether or not a transaction is fraudulent. \n",
    "\n",
    "\n",
    "> https://github.com/awslabs/fraud-detection-using-machine-learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb8c412d-b5a4-49e7-b379-41498f32ec46",
   "metadata": {},
   "outputs": [],
   "source": [
    "fg_train_dataset.groupby('Class').count().toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8920c941-29df-4c61-83d3-9666d06cdc97",
   "metadata": {},
   "outputs": [],
   "source": [
    "fg_test_dataset.groupby('Class').count().toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bafae023-f0b0-482a-9117-f0244e4096d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_cols = fg_train_dataset.columns[:-1]\n",
    "feature_cols"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60261600-2b32-425e-8bf1-66e3bf3da41f",
   "metadata": {},
   "outputs": [],
   "source": [
    "label_col = fg_train_dataset.columns[-1]\n",
    "label_col"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4e49a06-ec27-47f1-8eb0-bc55da65b94d",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = get_vectorassembler(fg_train_dataset, label=label_col, features='features')\n",
    "test_data = get_vectorassembler(fg_test_dataset, label=label_col, features='features')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7b4e6ba-ad56-4caa-a678-245c0462f93e",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data.limit(10).toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37bafe3c-c93b-446e-a9c7-213f0764d917",
   "metadata": {},
   "source": [
    "We split train dataset for model training and validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff73d43c-4bd3-40bd-a440-db665a37158f",
   "metadata": {},
   "outputs": [],
   "source": [
    "train, valid = train_data.randomSplit([0.90, 0.10], seed=2022)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30959f13-12b1-4d80-8d04-448d30ea4030",
   "metadata": {},
   "outputs": [],
   "source": [
    "train.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf11325b-4904-4b35-ae20-95ea7a893512",
   "metadata": {},
   "outputs": [],
   "source": [
    "valid.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e59980e-f80d-47bb-ba8c-9de486d5f0ed",
   "metadata": {},
   "source": [
    "### Train model\n",
    "Here we are using `isUnbalance=True`, please refer to [LightGBM docs](https://mmlspark.blob.core.windows.net/docs/0.18.1/pyspark/mmlspark.lightgbm.html) for the description of this parameter. Moreover, we will test the model performance by using multiple metrics, such as `AUC`, `KS`, `Balanced accuracy`, `Cohen's kappa` and `Confusion Matrix`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72a56474-a0a9-4f82-823d-3c62c1642396",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_params =  {\n",
    "    'boostingType':'gbdt',\n",
    "    'objective':'binary',\n",
    "    'metric':'auc',\n",
    "    'numLeaves': 7,\n",
    "    'maxDepth': 4,\n",
    "    'minDataInLeaf': 100,\n",
    "    'maxBin': 100,\n",
    "    'minGainToSplit': 0.0,\n",
    "    'featureFraction': 0.7,\n",
    "    'baggingFraction': 0.9,\n",
    "    'baggingFreq': 1,\n",
    "    'learningRate': 0.01,\n",
    "    'numIterations': 300,\n",
    "    'earlyStoppingRound': 100,\n",
    "    'verbosity':1,\n",
    "    'numThreads':16,\n",
    "}\n",
    "\n",
    "def train_lightgbm(train_dataset, feature_col, label_col, model_params):\n",
    "    from synapse.ml.lightgbm import LightGBMClassifier\n",
    "    model = LightGBMClassifier(isProvideTrainingMetric=True, featuresCol=feature_col, labelCol=label_col, isUnbalance=True, **model_params)\n",
    "    model = model.fit(train_dataset)\n",
    "    return model\n",
    "\n",
    "def evaluate(predictions, label_col, metricName=\"areaUnderROC\"):\n",
    "    evaluator = BinaryClassificationEvaluator(labelCol=label_col, metricName=\"areaUnderROC\")\n",
    "    return evaluator.evaluate(predictions)\n",
    "\n",
    "model = train_lightgbm(train, 'features', 'Class', model_params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5acd2a72-5fcb-41df-ba09-e5d2f478febc",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"train dataset prediciton:\")\n",
    "predictions = model.transform(train)\n",
    "print(\"train dataset auc:\", evaluate(predictions, label_col))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df0ef43d-8fa8-4f36-b8f7-31402442d43f",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"validation dataset prediciton:\")\n",
    "predictions = model.transform(valid)\n",
    "print(\"validation dataset auc:\", evaluate(predictions, label_col))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab1dc8e2-0dd6-4b39-a3ca-34afdcaf8fe5",
   "metadata": {},
   "outputs": [],
   "source": [
    "importance_df = (\n",
    "    pd.DataFrame({\n",
    "        'feature_name': feature_cols,\n",
    "        'importance_gain': model.getFeatureImportances('gain'),\n",
    "        'importance_split': model.getFeatureImportances('split'),\n",
    "    })\n",
    "    .sort_values('importance_gain', ascending=False)\n",
    "    .reset_index(drop=True)\n",
    ")\n",
    "print(importance_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b23b82dc-fbb5-479b-9c19-cde921c38d27",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = train_lightgbm(train_data, 'features', 'Class', model_params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "253eeb61-c859-46a1-912a-e811b5bdd624",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"test dataset prediciton:\")\n",
    "predictions = model.transform(test_data)\n",
    "print(\"test dataset auc:\", evaluate(predictions, label_col))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "240e4d3d-d9b9-4ac9-a5cd-e9b1e5f82f9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictionAndLabels = predictions.select('prediction', F.col(label_col).cast(DoubleType()))\\\n",
    "                                 .withColumnRenamed(label_col, 'label')\n",
    "metrics = MulticlassMetrics(predictionAndLabels.rdd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3863cfc7-6995-48b5-9777-6f043f619a26",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sn\n",
    "plt.figure(figsize = (10,7))\n",
    "sn.heatmap(metrics.confusionMatrix().toArray(), \n",
    "           xticklabels=['Not Fraud', 'Fraud'],\n",
    "           yticklabels=['Not Fraud', 'Fraud'],\n",
    "           linewidths=5, fmt='g', annot=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1a26fdb-8b1b-4c4d-b88e-b75da4cca9ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "sys.path.append('../../') \n",
    "from common.ks_utils import ks_2samp, ks_curve\n",
    "\n",
    "label = np.array(predictions.select(label_col).collect()).reshape(-1).astype(np.float32)\n",
    "prediction = np.array(predictions.select('probability').collect())[:, :, 1].reshape(-1)\n",
    "print('label: ', label[0:10])\n",
    "print('prediction: ', prediction[0:10])\n",
    "\n",
    "ks = ks_2samp(label, prediction)\n",
    "print(\"KS statistic: \", ks.statistic)\n",
    "ks_curve(label, prediction)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c55034c-ca4c-4a28-9117-d62410c388ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score\n",
    "\n",
    "# scikit-learn expects 0/1 predictions, so we threshold our raw predictions\n",
    "y_preds = np.where(prediction > 0.5, 1, 0)\n",
    "print(\"Balanced accuracy = {}\".format(balanced_accuracy_score(label, y_preds)))\n",
    "print(\"Cohen's Kappa = {}\".format(cohen_kappa_score(label, y_preds)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2955ea59-5ded-47d9-af92-2a6ec8759a62",
   "metadata": {},
   "source": [
    "### SMOTE\n",
    "Now we have a baseline model using LightGBM. Let us will verify the over sampling tricks on traning data that using SMOTE. Moreover, we will test the model performance by using multiple metrics, such as `AUC`, `KS`, `Balanced accuracy`, `Cohen's kappa` and `Confusion Matrix`. You should replace ``{MY_S3_BUCKET}``, ``{TRAIN_SMOTE_S3_PATH}`` with actual values before executing code cells containing these placeholders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "448958c6-bd2d-447e-932b-85001a3bc5d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_file_path = 's3://{MY_S3_BUCKET}/{TRAIN_SMOTE_S3_PATH}'\n",
    "fg_train_smote_dataset = read_dataset(spark, train_file_path)\n",
    "train_smote_data = get_vectorassembler(fg_train_smote_dataset, label=label_col, features='features')\n",
    "train_smote_data.groupby('Class').count().toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c816424e-f9ff-4685-8eee-3886d65e8736",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = train_lightgbm(train_smote_data, 'features', 'Class', model_params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06fc73da-ff71-4b17-9ec9-3f4926af9f6e",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"test dataset prediciton:\")\n",
    "predictions = model.transform(test_data)\n",
    "print(\"test dataset auc:\", evaluate(predictions, label_col))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2858ecd3-84fd-429f-b11d-77cda07c1f18",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictionAndLabels = predictions.select('prediction', F.col(label_col).cast(DoubleType()))\\\n",
    "                                 .withColumnRenamed(label_col, 'label')\n",
    "metrics = MulticlassMetrics(predictionAndLabels.rdd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef65ab00-dbd6-410a-9ffc-e8256529b5f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sn\n",
    "plt.figure(figsize = (10,7))\n",
    "sn.heatmap(metrics.confusionMatrix().toArray(), \n",
    "           xticklabels=['Not Fraud', 'Fraud'],\n",
    "           yticklabels=['Not Fraud', 'Fraud'],\n",
    "           linewidths=5, fmt='g', annot=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3dfcd6c1-e752-459a-b68c-98063f414f14",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "sys.path.append('../../') \n",
    "from common.ks_utils import ks_2samp, ks_curve\n",
    "\n",
    "label = np.array(predictions.select(label_col).collect()).reshape(-1).astype(np.float32)\n",
    "prediction = np.array(predictions.select('probability').collect())[:, :, 1].reshape(-1)\n",
    "print('label: ', label[0:10])\n",
    "print('prediction: ', prediction[0:10])\n",
    "\n",
    "ks = ks_2samp(label, prediction)\n",
    "print(\"KS statistic: \", ks.statistic)\n",
    "ks_curve(label, prediction)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70f61b52-e368-416a-886f-34808660c234",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score\n",
    "\n",
    "# scikit-learn expects 0/1 predictions, so we threshold our raw predictions\n",
    "y_preds = np.where(prediction > 0.5, 1, 0)\n",
    "print(\"Balanced accuracy = {}\".format(balanced_accuracy_score(label, y_preds)))\n",
    "print(\"Cohen's Kappa = {}\".format(cohen_kappa_score(label, y_preds)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de33028a-0155-43ca-a35e-34488c23905d",
   "metadata": {},
   "outputs": [],
   "source": [
    "spark.stop()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a806243a-bc5e-492d-975f-cb29dc94ac1d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
